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Abstract 

Scientific workflow management systems offer features for composing complex 
computational pipelines from modular building blocks, for executing the resulting 
automated workflows, and for recording the provenance of data products resulting 
from workflow runs. Despite the advantages such features provide, many automated 
workflows continue to be implemented and executed outside of scientific workflow 
systems due to the convenience and familiarity of scripting languages (such as Perl, 
Python, R, and MATLAB), and to the high productivity many scientists experi¬ 
ence when using these languages. YesWorkflow is a set of software tools that aim 
to provide such users of scripting languages with many of the benefits of scientific 
workflow systems. YesWorkflow requires neither the use of a workflow engine nor the 
overhead of adapting code to run effectively in such a system. Instead, YesWorkflow 
enables scientists to annotate existing scripts with special comments that reveal the 
computational modules and dataflows otherwise implicit in these scripts. YesWork¬ 
flow tools extract and analyze these comments, represent the scripts in terms of 
entities based on the typical scientific workflow model, and provide graphical ren¬ 
derings of this workflow-like view of the scripts. Future versions of YesWorkflow 
also will allow the prospective provenance of the data products of these scripts to 
be queried in ways similar to those available to users of scientific workflow systems. 
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1 Introduction 


Many scientists use scripts (written, e.g., in Python, R, or MATLAB) or scientific work- 
flow environments for data processing, analysis, model simulation, result visualization, 
and other scientific computing tasks. In addition to the widespread use in the natural 
sciences, computational automation tools are also increasingly used in other domains, 
e.g., for data mining workflows in the digital humanities [VZ12], or to implement data 
curation workflows for natural history collections [DCM+12]. One advantage of using sci¬ 
entific workflow systems (e.g., Galaxy [GNT10], Kepler [LAB+06], Taverna [OAF+04], 
VisTrails [BCC + 05], RestFlow [MM10, TMnG + 13]) is that they often include capabili¬ 
ties to track data as it is being processed. By capturing and subsequently sharing such 
provenance information, scientists can provide a detailed account of how their results 
were derived from the given inputs via intermediate results, workflow steps, and param¬ 
eter settings, thereby facilitating transparency and reproducibility of workflow products. 
In addition to this external use, provenance information can also be used internally, e.g., 
to allow scientists to trace sources of errors and to debug their workflows. 

The data provenance captured by workflow environments is sometimes called retro¬ 
spective provenance to distinguish it from another form called prospective provenance 
[CFV + 08, LLCF10]. The former consists of data dependencies and lineage information 
recorded at runtime, which can then be used later for retrospective exploration and 
analysis (a.k.a. “querying provenance” [DF08]). In constrast, prospective provenance is 
a description of the computational process itself, i.e., the workflow specification is consid¬ 
ered a form of provenance information, describing the method by which analysis results 
and other data products are obtained. Scientific workflow systems therefore naturally 
support both forms of provenance, i.e., prospective provenance by visually presenting a 
workflow as a directed graph with data and process steps, and retrospective provenance 
by capturing and subsequently exporting runtime provenance. 

Despite these and other advanced features of workflow systems, a vast number of 
computational “workflows” continue to be developed using general purpose or special¬ 
ized scripting languages such as Python, R, and MATLAB. This is true in particular for 
the “long tail of science” [WRB13, Hei08], where advanced features such as provenance 
support are rarely available. For example, provenance libraries for R have only recently 
been announced [LB14], while for Python, a new tool called noWorkflow has just been 
developed [MBC + 14]. The noWorkflow ( not only workflow ) system uses Python run¬ 
time profiling functions to generate provenance traces that reflect the processing history 
of the script. Thus, noWorkflow allows users to continue working in their familiar Python 
scripting environment, without adopting a new system, while retaining the advantage of 
automatic capture of retrospective provenance information similar to the one available 
in workflow systems. 

In the following, we describe a new tool called YesWorkflow that complements noWork¬ 
flow by revealing prospective provenance in scripts, i.e., YesWorkflow makes latent work- 
flow information from scripts explicit. In particular dataflow dependencies that are often 
“hidden” inside of a script and not easily understood by outsiders looking at the script 
are extracted from simple user annotations and can then be exported and visualized in 
graph form. 
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The main features of YesWorkflow (or YW for short) are: 

• YW exposes prospective provenance (workflow structure and dataflow dependen¬ 
cies) from scripts based on simple user annotations. 

• YW annotations are embedded inside of comments, so they are language indepen¬ 
dent and can be used, e.g., in Python, R, and MATLAB. 

• YW annotations and the underlying model are deliberately kept simple to allow 
scientists a very low entry bar for adoption. 

• The YW-toolkit is a grass-roots, agile, open source effort, whose simple and mod¬ 
ular architecture and underlying UNIX philosophy facilitates interoperability and 
extensibility. 

• The current YW prototype generates different, easily reusable output formats, in¬ 
cluding three different graph views, i.e., a process-centric , a data-centric, and a 
combined view of the extracted workflow graph in Graphviz/DOT form. 

We discuss YW limitations and plans for future development in Section 7. 


2 YesWorkflow Model and Annotation Syntax 

In order to use the YesWorkflow tools, a script author marks up scripts using a simple 
keyword-based annotation or tagging mechanism, embedded within the comments of the 
host language. YW annotations are expressions of the form @tag u value. Here, @tag 
is one of the recognized YW keywords, after which a value follows, separated by one 
or more whitespace characters. Thus, the YW annotation syntax mimics the syntax of 
conventional documentation generators such as Javadoc and DOxygen. 

The YW tool then interprets the embedded, structured comments and builds a simple 
workflow model of the script. This model represents scripts in terms of scientific workflow 
entities, i.e., programs, workflows, ports, and channels: 

• A program block (short: program or block ) represents a computational step in the 
script that receives input data and produces (intermediate or final) output data. 
A program is designated in a script by bracketing the relevant code between a pair 
of ©begin and Send comments. Program blocks are usually visualized as boxes. A 
block that contains other programs is considered a workflow. 

• A port represents a way in which data flows into or out of a program or workflow. 
Ports are identified by @in and ©out annotations in the source code comments. 

• A channel is a connection between an ©out port of a program and an Sin port of 
another (or, in case of feedback loops, the same) program. YW infers channels by 
matching the names of ©in and Oout ports within the same workflow. 

Figure 1 depicts a workflow view extracted from a sample Python script for standardizing 
Net Ecosystem Exchange (NEE) data in the MsTMIP project; cf. Section 4.2. 
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Figure 1: Process-oriented workflow view of a script: boxes represent programs (code 
blocks); edges represent dataflow channels ; edge labels indicate data elements. 


Alternative Workflow Views. The process-oriented view in Figure 1 is the default 
YW view shown to the user, as it emphasizes the overall block structure, given by the 
script author using ©begin and ©end markers. However, the extracted YW model can 
also be rendered in other forms. For example, Figure 2 depicts a data-oriented view, 
where data elements (i.e., dataflow channels obtained from ©in and ©out tags) are shown 
as nodes, while programs are only mentioned in edge labels. Finally, Figure 3 shows a 
combined workflow view, i.e., in which both programs and data channels are represented 
as nodes. 
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Figure 2: Data-oriented workflow view: program blocks are mentioned in edge labels 
only, while data channels are exposed as proper graph nodes. 
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Figure 3: Combined workflow view of a script: both programs and data are nodes. 


3 Querying YesWorkflow Models 

The workflow structure of large scripts can be difficult to interpret fully even when 
represented graphically. While the YW prototype is limited to such graphical views, the 
YW comments and model are sufficient to support queries that reveal specific aspects of 
the script in workflow terms. Example workflow-structure queries that will be supported 
by YesWorkflow include: 

• List all of the code blocks defined in the script along with any description given 
for each. 

• List the code blocks nested (directly or indirectly) within a particular code block. 

• List the code blocks that invoke a particular function or external program. 
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• List the code blocks that contain a particular block (directly or indirectly). 

• List the code blocks that receive inputs derived (directly or indirectly) from the 
outputs of a particular upstream code block. 

• List the code blocks affected (directly or indirectly) by a particular parameter value 
provided to the script. 

Prospective Data Provenance Queries. YesWorkflow additionally will allow scripts 
marked up with YW comments to be queried from a data provenance perspective. Be¬ 
cause YesWorkflow analyzes the definition of a workflow (the script plus YW comments) 
rather than information recorded during a run of the script, YesWorkflow will support 
queries against prospective provenance. Example prospective provenance queries include: 

• Given the name of an output of the script, list the inputs to the script that the 
output depends on (directly or indirectly). 

• List the computational steps (code blocks) involved in deriving a particular output 
of the script, or of a named intermediate data product. 

• For a particular computational step reveal where each input to the step comes from: 
an input to the script, a constant in the script, a value produced by a different step, 
etc. 

• Reveal the complete derivation of a particular script output. That is, list the 
sequence of code blocks and input and intermediate data products leading to the 
output. Results of queries of this kind optionally may be rendered graphically. 

Inference of Retrospective Data Provenance. As described above, YesWorkflow 
will allow prospective provenance to be inferred from scripts marked up with YW com¬ 
ments. We additionally foresee that combining the information extracted from a marked- 
up script with references to data files corresponding to a run of that script will in some 
cases allow the retrospective provenance of those files to be inferred (see also [BML12] 
and [ZL10]). That is, in cases where the entire sequence of data derivation steps for a par¬ 
ticular output can be determined unambiguously from YW annotations, YesWorkflow will 
support queries of the following kind even in the absence of a run-time data-provenance 
recorder: 

• Given a file output by a run of a script, indicate which files input to the script this 
output file was derived from (or affected by). 

• Given an input file to a script, indicate which output files were derived (or affected) 
by the data contained in that file. 

• Indicate which parameter values applied to a run of the script affected which of its 
output files. 


5 



Figure 4: Process workflow view of an Affymetrix analysis script (in R). 


4 YesWorkflow Examples 

In the following we show YesWorkflow views extracted from real-world scientific use cases. 
The scripts were annoted with YW tags by scientists and script authors, using a very 
modest training and mark-up effort. 1 Due to lack of space, the actual MATLAB and R 
scripts with their YW markup are not included here. However, they are all available 
from the yw-idcc-15 repository on the YW GitHub site [Yesl5]. 

4.1 Analysis of Gene Expression Microarray Data 

Bioinformatics workflows commonly possess a pattern of large numbers of incoming pa¬ 
rameters and outputs at each stage of computation. In addition, analysis of even a 
single bioinformatics dataset tends to yield a large number of different output files. 
Hence, bioinformatics pipelines are attractive candidates for workflow systems, which 
can capture this complexity [Biel2]. Figure 4 shows a YesWorkflow representation of 
an R script performing a classic, complex bioinformatics task: analysis of Affymetrix 
gene expression microarray data. This R script was modeled on our previous work- 
flows developed in the Kepler environment [SMLB12]. The script analyzes experiment 
designs consisting of two conditions (e.g., microarrays from control-treated cells vs mi¬ 
croarrays from drug-treated cells) with multiple replicates in each condition. The R 
script employs a set of standard BioConductor [GCB+04] packages mixed with custom 
programming. The workflow consists of four fundamental tasks: normalization of data 
across microarray datasets (Normalize), selection of differentially expressed genes (DEGs) 
between conditions (SelectDEGs), determination of gene ontology (GO) statistics for the 
resulting datasets (GCLAnalysis), and creation of a heatmap of the differentially ex¬ 
pressed genes (MakeHeatmap). Each module produces outputs, and each module (aside 
from MakeHeatmap) requires external parameter inputs. Importantly, this graphical rep¬ 
resentation clearly indicates the dependence of each module on datasets and parameter 
inputs. This example demonstrates that YesWorkflow can provide informative visualiza¬ 
tions of bioinformatics workflows, especially workflows involving large numbers of inputs 
and outputs. 

1 For all of these scripts, learning the YW model and annotating the scripts was done in a few hours. 
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Figure 5: Combined workflow view of a MsTMIP script (in MATLAB). YW views can 
be easily tweaked via Graphviz properties in the generated DOT files: here, a “Taverna- 
style” top-down layout is used, as opposed to the default left-to-right display. 

4.2 Terrestrial Biospheric Modeling 

In the Multi-scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP) 2 , 
climate scientists primarily use MATLAB scripts to standardize terrestrial biosphere 
model output across multiple models and simulation runs for intercomparison purposes 
and to facilitate diagnosis and attribution. MsTMIP is a large, collaborative effort, aimed 
at harmonizing a number of complex terrestrial biospheric models for the purposes of 
comparing these model outputs [HSM+13]. There is a strong need to standardize many 
aspects of the MsTMIP process, to assure greater uniformity in the treatment of the codes 
and outputs of the disparate models in the intercomparison analyses. Current practice 
in MsTMIP, however, is representative of many scientific investigations, i.e., researchers 
develop their codes with a specific focus on functionality and efficiency. Comments are 
added primarily as “bookmarks” to assist with accessing appropriate code areas for de¬ 
bugging, optimization, or discussion. In the more general case, depending on whether the 
codes are developed in a collaborative context, structured in-code documentation may 
be recommended or required by the project. Nevertheless, the mechanisms for these 
“code annotations” are typically unformalized and unstructured, and rely primarily on 
the ability to insert non-executable “comment” statements in the code. 

As the complexity of code grows, and the numbers of variants and alternative ap¬ 
proaches increases, MsTMIP researchers need a clear and consistent way to document, 
review, and share their model intercomparison scripts. This provides a compelling use 

2 http://nacp.ornl.gov/MsTMIP.shtml 
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Figure 6: Combined workflow view of a paleoclimate reconstruction R script [BK14], 

case for YW, in that MsTMIP brings together models from a number of independent 
efforts that require harmonization into a single framework for evaluating their relative 
capabilities to predict critical earth system features, such as global Net Ecosystem Ex¬ 
change (NEE) data from terrestrial biogeographic realms. 

4.3 Paleoclimate Reconstruction 

As another working example from a different field, we have used the YesWorkflow markup 
syntax to analyze the paleoclimate reconstruction workflow presented by Bocinsky and 
Kohler [BK14]. Their reconstruction method takes as input a spatial interpolation of 
contemporary weather data, the long-term record of climate held in regional tree-ring 
chronologies, and a handful of parameters, and uses a novel regression-based analysis 
method to generate spatial reconstructions of climate extending 2000 years or more back 
in time. Figure 6 shows that the YW system nicely exposes the prospective provenance 
hidden in the underlying R script, even for scripts whose workflow views are highly 
non-linear. 

























































5 YW Architecture 


The YesWorkflow software distribution is envisioned as a set of standard modules that can 
be used together or independently. The primary goal of this modularity is to enable YW 
users and developers independently to implement alternatives to any module, as needed, 
to solve problems particular to their research domain. It will be possible to develop these 
alternative implementations and extensions in any programming language. One way we 
plan to facilitate such easy replacement of YW modules is to require that each standard 
module optionally input and output files-with well-defined formats-representing the ex¬ 
pected inputs or outputs of that module. Any program that produces or consumes these 
file formats can then function as an alternative to one or more standard YW modules and 
can provide identical, overlapping, or completely different capabilities (e.g., the current 
YW prototype is primarily implemented in Java, but also contains some alternative YW 
modules implemented in Python). 

Five standard modules (implemented in Java) currently are implemented or planned: 
The YW-Extract module identifies YW comments in a script and produces a language- 
independent representation of the script and the YW annotations. YW-Model interprets 
the comments identified by YW-Extract and builds a model of the script in terms of 
entities analogous to the components of a traditional scientific workflow as described in 
Section 2, while YW-Graph. operates on the outputs of YW-Model to produce the dataflow 
graphs discussed in that same section. As described in Section 3, the planned YW-Query 
module will allow users to probe the structure of a complex script without having to 
inspect a visual representation of it. An envisioned YW-Validate module will ensure 
that YW comments in a script are consistent both with the other YW comments in the 
script and with the script itself. Finally, the YW-CLI module enables a user to execute 
sequences of the standard modules, starting from an input file with format appropriate 
to the first module in the executed sequence. 

6 Related Work 

The YW approach can be seen in the tradition of programming code annotation, which 
is widely used for facilitating code understanding and for generating documentation 
(e.g., DOxygen 3 , Epydoc 4 , Javadoc 5 , etc.) YW builds on programming code annotation 
to provide a higher level of abstraction by revealing the dataflow that underlies the 
interactions between the different pieces of a script or program. 

YW is also related to ideas from literate programming 6 and available in tools such as 
Knitr [Xiel3] and IPython [PG07]. In literate programming, a script is decomposed into 
snippets of macros, which are interspersed within documents that are written in natural 
language to explain the scripts and eventually analyze the results it generates upon 
execution. While borrowing ideas from literate programming, YW is primarily targeted 
for developers who are using pure traditional scripting environments to edit their scripts 

3 www. doxygen. org 

4 epydoc. sourceforge .net 

5 www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html 

®Don Knuth has argued [Knu84] that we should change our traditional attitude to programming: 
“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather 
on explaining human beings what we want a computer to do”. 
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and programs. YW aims at providing a consistent interpretation and visualization of 
codes wherever the language provides for insertion of non-executable “comments”. 

YW can also contribute to the area of reproducible computational research [SLP14], 
which seeks to provide scientists with sufficient information to understand and eventually 
validate the results claimed by their peers. For instance, the SOLE system [PMF+12] 
allows linking articles with science objects, which can be source code, a dataset, or a 
workflow. SOLE allows the reader (curator) to specify human-readable tags that link 
the paper with science objects, and it transforms each tag into a URI that points to 
a representation of the corresponding object. While in SOLE the scientific article is 
the main object that contains links to other (science) objects, we focus on the scripts 
produced by the scientists, and aim to facilitate the understanding of their dataflow 
logic. Gavish and Donoho [GD11] present the notion of a Verifiable Computational 
Result (VCR), where every result is assigned a unique identifier, and results produced 
under the exact same conditions have the same identifier to support reproducibility. 

Various tools have been proposed to capture the runtime provenance of scripts. Mech¬ 
anisms that capture provenance at the operating system level [FMS08, GS12, MRHBS06] 
monitor system calls to track the data dependencies between computational processes. 
Some tools [BGS08, Davl2, BAW13, MBC + 14] have been developed to capture runtime 
provenance for Python scripts: while Bochner et al. [BGS08] and Davison [Davl2] pro¬ 
pose Python libraries and APIs that need to be added to the code to capture the execution 
steps, ProvenanceCurious [BAW13] and no Workflow [MBC + 14] are transparent and do 
not require changes to the scripts. Similarly, RDataTracker [LB14] captures provenance 
from the execution of R scripts, and the approach taken by Tariq et al. [TAG12] supports 
all programming languages allowed by the LLVM compiler framework. We note that the 
YW approach is complementary to these tools, since it captures prospective provenance 
of scripts. We argue that YW, along with runtime provenance approaches, provide a 
low-effort entry point for scientists who want to reap some of the benefits of scientific 
workflow systems while still using their familiar scripting environments. 


7 YesWorkflow Development Roadmap 

In the following we list some limitations of the current YW prototype and highlight 
features planned for future releases of the software. 

Visualization of Nested Code Blocks. The YW-Extract and YW-Model modules 
support nesting of code blocks. Any pair of Obegin and (Send comment lines can enclose 
code that contains any number of other code blocks delimited with Obegin and Qend 
comment lines. The workflow model constructed for a script reflects such nesting, i.e. 
the top-level workflow corresponding to the script as a whole may contain one or more 
programs (code blocks), and any of these programs can in turn be a sub-workflow that 
contains further nested programs and workflows. Future versions of YW-Graph will reveal 
these nested code blocks and render sub-workflows graphically. 

Functions and Function Calls. YW-Extract currently expects nested code blocks to 
be defined in-line. Bowever, many scripts are structured as functions (or classes) with a 
top-level script that calls these functions (or methods on objects). These functions can in 
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turn call other functions. Future versions of YesWorkflow will allow function declarations 
to be marked up with YW comments in a manner similar to that supported by Javadoc 
and DOxygen. Calls to these functions also will be annotated with YW markup. The 
result will be that YW-Extract and YW-Model will be able to represent function calls as 
nested code blocks. 

Interactive Graphs. YW-Graph currently produces static graphical views (in the well- 
known Grapliviz-DOT format). An interactive viewer for YW graphical output will make 
these graphs easier to explore and interpret. In the planned graphical user interface, 
clicking on a data item in the combined or data views optionally will highlight the 
(prospective) direct and indirect data dependencies for that data item (the data from 
which it will be derived when the script is run). Features for expanding and collapsing 
nested subworkflows also will facilitate exploration of these graphs. 

Live Graph View. Although the primary function of YesWorkflow is to reveal workflow¬ 
like structure in existing scripts, YesWorkflow also can be used as a design tool when de¬ 
veloping new scripts (or even before a script is written). Future versions of YesWorkflow 
will better support such applications by providing live-update features to the interactive 
graph capabilities described above. Given a set of script files, the live-graph feature 
will monitor these hies for changes and update the chosen graphical view automatically. 
Users of this feature will continue to be able use their favorite text editor or IDE for 
developing their scripts. 

Distinguished Data and Parameters. The inputs to scripts for processing scientific 
data often can be viewed either as data (the data to be processed by the scripts) or as 
parameters (values that control how that data is processed). Planned versions of the 
YW comment vocabulary will allow data and parameters to be distinguished. YW-Graph 
optionally will emphasize graph edges, nodes, and labels representing data over those 
representing parameters. 

Validation of Comments. The future YW-Validate module will perform extensive 
validation of YW comments in light of the actual code in the script. This capability 
will help guide users adding YW comments to their script. Perhaps more importantly, 
automatic validation w T ill help prevent initially correct YW comments from becoming 
stale (i.e., incorrect) when the underlying script is changed or refactored. Validity checks 
that YW-Validate will perform include: 

• Confirm that data names used in ©in and ©out comments actually appear in the 
code bracketed by associated ©begin and ©end comments. 

• Confirm that the names of functions referred to in YW comments for function 
declaration or for function calls match the names of the functions actually declared 
or called. 

• Confirm that continuous data dependency chains exist from each script output all 
the way back to script inputs (and embedded constants). 
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8 Conclusions 


YesWorkflow is an agile, grass-roots effort that aims at bringing workflow modeling and 
analysis features to scientific “workflows” that are defined in script form. Through simple 
user-annotations in the comments of scripts, dataflow and workflow structure are revealed 
by the YW toolkit. The user can thus exploit prospective provenance information from 
scripts, e.g., by visualizing, querying, and analyzing this information. 

Our early YW prototype [Yesl5] has been used by scientists from different domains to 
mark up complex, real-world scientific scripts with ease. Encouraged by the enthusiastic 
response of the early adopters, a number of researchers will be incorporating YW into 
their projects, thereby guiding and driving the future development of YesWorkflow. 

MsTMIP researchers plan to annotate their scripts such that authors, as well as 
reviewers and potential new users, will be able to click on the workflow steps in the 
interactive YW graph viewer and inspect the corresponding code-blocks in the original 
script. When clicking on data elements, they will be taken to a folder containing the 
data instances that were used in the various runs of the script (provided these have been 
shared). Since the YW approach is language independent, it will also facilitate code 
migration, say from MATLAB to R, or from R to Python. 

In the Kurator project [HLM14] we plan to enable collection managers to author their 
own data curation workflows using both an Akka-based workflow system and via scripting 
languages such as Python and R. In the latter case, Kurator tool users will annotate their 
scripts with YW comments to enable provenance queries to span script-based curation 
workflows. The Kurator team also plans to use the YW-Graph. and YW-Query tools to 
graphically render workflows defined using the Kurator-Akka workflow system and to 
query the prospective provenance of products of these workflows. 

Finally, DataONE is planning a number of enhancements to the YW annotation 
language. For example, in addition to the currently supported, simple user-defined vo¬ 
cabulary for program blocks and data elements, controlled vocabularies from shared 
ontologies may be used with these extensions. Similarly, to improve YW interoperability 
within the DataONE infrastructure, PROV [MM13] and ProvONE [CVLM+15] compat¬ 
ible vocabulary extensions may be used in YesWorkflow in the future. 

Acknowledgements. Work supported in part by the National Science Foundation un¬ 
der awards DBI-1356751 (Kurator), ACI-0830944 (DataONE), SMA-1439603 (SKOPE). 
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